Modified character-level deciphering algorithm for OCR in degraded documents

نویسندگان

Chi Fang

Jonathan J. Hull

چکیده

Modi cations to a previous character level deciphering algorithm for OCR are presented in this paper that are able to handle touching characters and are tolerant to mistakes made at the clustering stage The objective of a character level deciphering algorithm is to assign alphabetic identities to character patterns such that the character repetition pattern in an input text matches the letter repetition pattern provided by a language model Degradation in document images usually causes the occurrence of touching characters and mistakes in clustering the character patterns which pose di culties for character level deciphering algorithms The modi cations proposed in this paper tightly integrate visual constraints from characters and touching patterns with constraints from a language model This solves the problem of touching characters and reverses clustering mistakes The provides a deciphering algorithm with robust performance under image degradation

متن کامل

منابع مشابه

Performance Evaluation of Two Arabic OCR Products

Numerous Optical Character Recognition (OCR) companies claim that their products have near-perfect recognition accuracy (close to 99.9%). In practice, however, these accuracy rates are rarely achieved. Most systems break down when the input document images are highly degraded, such as scanned images of carbon-copy documents, documents printed on low-quality paper, and documents that are n-th ge...

متن کامل

Degraded Document Analysis and Extraction of Original Text Document: An Approach without Optical Character Recognition

Document Image Analysis recognizes text and graphics in documents acquired as images. An approach without Optical Character Recognition (OCR) for degraded document image analysis has been adopted in this paper. The technique involves document imaging methods such as Image Fusing and Speeded Up Robust Features (SURF) Detection to identify and extract the degraded regions from a set of document i...

متن کامل

Algorithms for postprocessing OCR results with visual inter-word constraints

Algorithms are presented that determine the visual relationships between word images in a document. These include instances of common word images and common substrings that occur often in English language text images. This information is then be used to improve the performance of a commercial optical character recognition (OCR) algorithm. The algorithms presented here calculate clusters of equi...

متن کامل

OCR of Degraded Documents using HMM-Based Techniques

We present an OCR system for handling degraded documents, such as faxed text. The basic system utilizes the BBN BYBLOS OCR system, which uses a Hidden Markov Model (HMM) approach for training and recognition. To handle degraded documents, we present two approaches, which can be applied individually or jointly. In the first approach, we train the system on documents that exhibit the expected kin...

متن کامل

Extraction of Original Text Document from a Set of Degraded Text Documents from the Same Source

Information extraction is the task of extracting structured data from a degraded document. It includes data extraction such as text, image or graphics from the sources such as an image, video or documents. Text detection and extraction from the degraded document finds application in wide range of study. In this paper, an Optical Character Recognition less (OCR-less) method of obtaining an origi...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

متن کامل

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1995

Modified character-level deciphering algorithm for OCR in degraded documents

نویسندگان

چکیده

منابع مشابه

Performance Evaluation of Two Arabic OCR Products

Degraded Document Analysis and Extraction of Original Text Document: An Approach without Optical Character Recognition

Algorithms for postprocessing OCR results with visual inter-word constraints

OCR of Degraded Documents using HMM-Based Techniques

Extraction of Original Text Document from a Set of Degraded Text Documents from the Same Source

عنوان ژورنال:

اشتراک گذاری